A practical assembly guideline for genomes with various levels of heterozygosity

Abstract Although current long-read sequencing technologies have a long-read length that facilitates assembly for genome reconstruction, they have high sequence errors. While various assemblers with different perspectives have been developed, no systematic evaluation of assemblers with long reads for diploid genomes with varying heterozygosity has been performed. Here, we evaluated a series of processes, including the estimation of genome characteristics such as genome size and heterozygosity, de novo assembly, polishing, and removal of allelic contigs, using six genomes with various heterozygosity levels. We evaluated five long-read-only assemblers (Canu, Flye, miniasm, NextDenovo and Redbean) and five hybrid assemblers that combine short and long reads (HASLR, MaSuRCA, Platanus-allee, SPAdes and WENGAN) and proposed a concrete guideline for the construction of haplotype representation according to the degree of heterozygosity, followed by polishing and purging haplotigs, using stable and high-performance assemblers: Redbean, Flye and MaSuRCA.


INTRODUCTION
The advent of third-generation sequencing technologies, represented by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (Nanopore), has allowed for very long de novo assemblies of complex genomes, including that of eukaryotic diploids [1].These technologies produce long nucleotide sequence reads by reading long single-molecule nucleic acids [2].According to the manufacturers' websites, the average read length and total amount of the read length for PacBio (Sequel, SMRT Cell 1 M) are 30 kb and 20 Gb, respectively, while those for Nanopore (MinION) are up to 4 Mb and 50 Gb, respectively.Therefore, they have been used for de novo assembly of challenging genomes with high heterozygosity [3], repetitive regions [4], large size [5] or high ploidy [6].However, they produce higher sequencing error rates (5-15%) [7] than Illumina short reads (0.3%) [8].Hence, most recent assemblers have applied the hierarchical approach to correct long-read errors by detecting sequence errors from alignments between long-reads of the same sample before assembly [9].Even after long-read-based assembly, the assembled sequences are further polished with Illumina short reads [3,[10][11][12][13].Several polishing tools, including Pilon, POLCA and NextPolish, correct sequence errors in the assembled sequences with long reads and/or Illumina short reads [14][15][16].Recently, PacBio offered a high-fidelity (HiFi) read technology, which produces an average 13.5 kb [17] of highly accurate (99.9%) reads (https://www.pacb.com/technology/hifi-sequencing/).Consequently, more de novo genome assemblers for the HiFi reads were developed [18,19].However, since HiFi sequencing is costlier than standard longread sequencing, genome assembly using conventional long-read technology is preferred [20][21][22].
Although de novo assemblers have been developed for diploids, including FALCON Unzip [32], Canu, Platanus-allee and Platanus [33], diploid genome assembly remains challenging.The contig set from de novo assembly, composed of one sequence pattern between homologous chromosomes and the sequence of the hemizygous region, is the desired haploid representation.However, highly heterozygous regions are not recognized as homologous regions of two chromosomes by any assemblers and are thus assembled separately.Therefore, the assembly size will be larger than the actual genome size [34].Some tools that distinguish allelic sequences (haplotigs) from homologous regions have been developed to solve this problem, including Purge Haplotig [34], purge_dups [35] and HaploMerger2 [36].
We evaluated assemblers based on computer resource usage (execution time and memory usage), continuity, and completeness using six genomes with various heterozygosity levels and proposed a concrete guideline for the construction of haplotype representation according to the degree of heterozygosity.The optimal genome-assembler combination is inf luenced by heterozygosity, repeats, genome size, as well as research purposes.Therefore, our guidelines are intended to help users select and further modify the best method to suit their genome characteristics and research purposes.

Collecting the datasets for evaluation
Since we focused on how heterozygosity affects genome assembly, we collected sets of genomes with varying levels of heterozygosity.The six available genomes of Arabidopsis thaliana accession C24 [10], Nitzschia putrida strain NIES-4239 [37], Lates calcarifer [38], Solanum sitiens [11], A. thaliana F1 cross of Col-0 and Cvi-0 [32] and Crassostrea gigas [3] were used.Their PacBio subreads and Illumina paired-end data of whole-genome sequencing were downloaded from the public archive Sequence Read Archive of European Nucleotide Archive (ENA) [39] (Table 1).The sequencing data of the six genomes have PacBio coverage ≥39x and Illumina coverage ≥60x.For the L. calcarifer PacBio subreads, 223x out of 247x were retrieved since they are sufficient to gain PacBio coverage >39x.

Analytical processes
The stepwise analytical processes for the construction of haploid representation comprised four processes: estimation of genome characteristics (such as genome size, heterozygosity, and repeat rate), de novo assembly, polishing, and purging of haplotigs (Figure 1).First, the genome characteristics were estimated by obtaining the k-mer counts through Jellyfish v2.2.10 [40] (-Cm 21 -s 1000000000) and using them to estimate the genome characteristics through GenomeScope [41] (k-mer_length = 21 and kmermax = 1000).The estimated genome characteristics were assessed by comparing the genome size with that in the original research.De novo assembly was then conducted.Illumina paired-end reads used for the hybrid assembler were trimmed using fastp v0.20.0 [42] (-cut_front -cut_tail option).Thereafter, the assembled sequences that were > 500 b in length were extracted using seqkit v0.15.0 [43].To polish the extracted assembled sequences, Illumina paired-end reads trimmed in the previous step were mapped to the assembled sequences using bwa v2.2.1 [44], and the result was determined using Pilon v1.24 [14].The manipulation of SAM/BAM files was performed using SAMtools v1.11 [45].Then, the haplotigs were removed from the polished sequences through Purge Haplotigs v1.1.1 [34].The read-depth cut-off parameters for the 'purge' command of Purge Haplotigs comprised three types, i.e. 'low cut-off' for removing assembly artefacts, 'midpoint' for specifying between the haploid and diploid peaks, and 'high cutoff' for removing repeats and organelles.These were set (Supplementary Table S1) by referring to the histograms of read-depth to contigs (Supplementary Figure S1) generated using the 'hist' command of Purge Haplotigs.Purge Haplotigs outputs three types of FASTA format files: haploid representation (primary sequences), haplotigs and artefacts (comprising assembly artefacts, organelle genome and repeat sequences).The details of the execution commands and configuration options of each tool during the de novo assembly and thereafter are described in the Supplementary Methods.The graph of cumulative length and Nx statistics for contigs were generated using R ver.4.0.5.Assembly ploidy, a metric that estimates the proportion of allelic sequences present in the assembled genome sequences, was calculated by dividing the total length by the estimated genome size [8].
Platanus-allee generates phased sequences and consensus sequences that are not phased.The other assemblers generate consensus sequences.Herein, the consensus sequences were consistently used for comparison among assemblers.Additionally, Flye, MaSuRCA_F and Platanus-allee output scaffolds, whereas SPAdes and MaSuRCA_C output scaffolds and/or contigs.The other assemblers output contigs but not scaffolds.Therefore, for comparison, we utilized scaffolds from Flye, MaSuRCA_F, Platanus-allee, SPAdes and MaSuRCA_C, as well as contigs from the other tools.Hereafter, both scaffolds and contigs are referred to as 'contigs' without distinction.

Evaluation of de novo assemblers
The resultant contigs were evaluated based on continuity and completeness.Contig continuity was evaluated by N50, which is the contig length when 50% of the total contig size is reached, while the assembled contig lengths were added in the longest order.The statistics of assembled contigs were calculated using assembly-stats v1.0.1 (https://github.com/rjchallis/assembly-stats) and Merqury v1.3 [48].The completeness was measured using Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis ver.5.0.0 [49].BUSCO databases used 'embryophyta_odb10' for A. thaliana C24 and F1, 'eukaryota_odb10' for N. putrida NIES-4239, 'actinopterygii_odb10' for L. calcarifer, 'solanales_odb10' for S. sitiens, and 'metazoa_odb10' for C. gigas.The N50 and BUSCO completeness of polished contigs were ranked for each genome as the best to third best and the worst to third worst.The worst ranking for BUSCO completeness was only used for the percentage of BUSCO completeness, including 'single-copy' and 'duplicated' if there was a difference of 15 or more from the top value in each genome.The assembler that reached the timeout was ranked as the worst for both N50 and BUSCO completeness.The continuity and completeness scores for each tool were calculated separately for heterozygosity <1 and heterozygosity ≥1 by adding 3 to the best, 2 to second best, 1 to third best, −3 to worst, −2 to second worst, and − 1 to third worst.Based on these scores, the performance of BUSCO completeness and N50 for each assembler was classified as 'high', 'medium' or 'low'.The thresholds for each classification were as follows.For heterozygosity <1: high ≥5, 4 ≥ medium ≥ −4 and low ≤ −5 for N50; high ≥5, 4 ≥ medium ≥1 and low ≤0 for BUSCO.For heterozygosity ≥1: high ≥3, 2 ≥ medium ≥ −2, and low ≤ −3 for N50; high ≥4, 3 ≥ medium ≥0 and low ≤ −1 for BUSCO.These classifications of the assemblers by computational resource usage, N50 and BUSCO completeness were used to select the assemblers in the guideline adapted to the degree of heterozygosity.Among the assemblers with similar evaluation, the assembler with the most stable performance for N50 and BUSCO completeness was adopted.
To measure computational usage under the same conditions, all de novo assemblers were utilized on the National Institute of Genetics supercomputer system medium nodes with 10 CPU cores (CPU: Intel Xeon Gold 6148 × 4, 80 core/node).To measure wallclock time and memory usage, 'ru_wallclock' and 'maxvmem' reported by the qacct command of the Univa Grid Engine were used.The assemblers were classified as 'Light', 'Medium' or 'Heavy' according to their maximum values of the wall-clock time and memory usage.The classification thresholds for the maximum value of wall-clock time for each tool were Light <50 h, 50 h ≤ Medium <250 h and Heavy ≥250 h.The classification thresholds for the maximum value of memory usage for each tool were Light <50 GB, 50 GB ≤ Medium <400 GB and Heavy ≥400 GB.The assemblers were classified by computational resource usage, including the wall-clock time and memory usage, as a comprehensive evaluation.The thresholds for 'Light' and 'Heavy' were the maximum value of wall-clock time < 50 h with the maximum value of memory usage <300 GB, and the maximum value of wall-clock time ≥ 250 h or the maximum value of memory usage ≥400 GB, respectively.'Medium' was defined as other than Light or Heavy in this case.
Furthermore, the contigs of Arabidopsis F1 hybrids were evaluated by comparing with the parental haploid sequences, Col-0 (TAIR10) and Cvi-0 [10], using QUAST v5.0.2 [50].Genome fraction is the percentage of the total number of bases aligned with contigs divided by the reference genome size.NGA50 is similar to N50 but uses the alignment block length and reference genome length instead of the contig length and total contig size for N50.The alignment block lengths are calculated by splitting contigs at misassembly breakpoints.Thus, NGA50 is the alignment block length at 50% of the total reference genome size.

Estimating genome characteristics
GenomeScope was used for all six genomes, and the estimated genome sizes were compared to the original research (Supplementary Table S2).The differences of estimated genome sizes from those of original research ranged from 6 to 27% in GenomeScope.The genome statistics estimated by GenomeScope were 0.055-3.00%heterozygosity, 32-906 Mb genome size and 2.8-43.9%repeats (Table 1).

Effect of polishing on assembly continuity
Contig statistics of assembly contigs and polished contigs are described in Supplementary Table S3 and Supplementary Table S4, respectively.Even after polishing, the total lengths obtained by almost all the assemblers did not change much from those before polishing (Supplementary Table S5).However, the largest contig length of miniasm for L. calcarifer decreased by 1.84% after polishing.Similarly, after polishing, the total contig length of miniasm for N. putrida NIES-4239 decreased by 1.59% after polishing and that of N50 of miniasm for L. calcarifer decreased by 2.58%.Therefore, polishing has a certain impact on contig length assembly, and we compared the assembly ploidy as well as continuity among the assemblers with polished contigs.

Assembly ploidy and continuity after polishing
As heterozygosity increased, the assembly ploidy of each assembler also increased (Figure 2).The graphs of the cumulative length for contigs are shown in Supplementary Figure S2 and the concrete values of assembly size in Supplementary Table S4.As for the genomes with heterozygosity <0.5, including those of A. thaliana C24, N. putrida NIES-4239, and L. calcarifer, most assemblers show that the assembly ploidies are approximately 1, implying that the total contig sizes almost equal the estimated genome sizes.However, in the S. sitiens genome (heterozygosity, 0.847), the assembly ploidy in miniasm, Platanus-allee and Canu exceeded 1.5.In the A. thaliana F1 genome (heterozygosity, 1.04), miniasm and Canu showed an assembly ploidy of approximately 2. Furthermore, in the C. gigas genome (heterozygosity, 3.00), the assembly ploidy in miniasm, Flye, Canu, MaSuRCA_F and MaSuRCA_C exceeded 2, whereas that in Redbean, NextDenovo, and Platanusallee exceeded 1.5.In the N. putrida NIES-4239 genome (heterozygosity, <0.5), SPAdes showed a much larger assembly ploidy of 12.90, whereas that of HASLR and WENGAN-M tended to be smaller overall.
The values of N50 are indicated in Figure 2; the graphs of the Nx statistics for contigs, in Supplementary Figure S3; and the concrete statistical values, including number of contigs, largest contig length, total length, N50, number of Ns and number of gaps, are listed in Supplementary Table S4.The continuity score, which is obtained by ranking N50 of each genome and summing the values of each rank, calculated for genomes with a heterozygosity <1 and ≥ 1, are indicated in Tables 2A and B, respectively.Consequently, for genomes with a heterozygosity <1, Flye, NextDenovo and MaSuRCA_C were classified as 'high'; MaSuRCA_F, Platanusallee, Redbean, miniasm, WENGAN_M and HASLR as 'medium'; and Canu and SPAdes as 'low'.For genomes with a heterozygosity ≥1, MaSuRCA_C, NextDenovo and Redbean were classified as 'high'; Canu, MaSuRCA_F, Flye, Platanus-allee and WENGAN_M as 'medium'; and HASLR, miniasm and SPAdes as 'low'.In general, MaSuRCA_C and NextDenovo provided better continuity regardless of heterozygosity in the given genomes, while Flye provided better continuity for genomes with a heterozygosity <1.

Effect of polishing on annotation completeness (BUSCO)
BUSCO completeness of assembly contigs and polished contigs are shown in Figures 3 and 4, respectively.The concrete BUSCO scores are indicated in Supplementary Table S6 and Supplementary Table S7.In particular for miniasm and Redbean, BUSCO completeness tended to improve significantly after polishing (Supplementary Table S8).For SPAdes, MaSuRCA_F and MaSuRCA_C, there was little change.Platanus-allee remained unchanged in all genomes.For HASLR, WENGAN-M, Flye, NextDenovo and Canu, the BUSCO completeness moderately improved after polishing.Since these results indicated that BUSCO completeness improved significantly before and after polishing the assembled contigs, especially with miniasm and Redbean, BUSCO completeness among the assemblers was compared based on the results after polishing in the following section.

Effect of heterozygosity on annotation completeness (BUSCO)
The 'complete score' (C) column and 'complete and duplicate score' (D) column of BUSCO for after polishing are indicated in Figure 2. BUSCO completeness after polishing is shown in Figure 4 and Supplementary Table S7.The genomes of A. thaliana C24 (heterozygosity, 0.055) and N. putrida NIES-4239 (heterozygosity, 0.336), which have lower heterozygosity, showed that their BUSCO completeness was not noticeably different among the assemblers, except for miniasm.For the other genomes, such as L. calcarifer (heterozygosity, 0.479), the BUSCO completeness was explicitly different among the assemblers, from 7.9 in miniasm to 98.8 in MaSuRCA_F.The completeness score, which is obtained by ranking the scores for complete BUSCO genes of each genome and summing the values of each rank, for each tool was separately calculated for genomes with heterozygosity <1 and ≥ 1 (Tables 2A and B).Consequently, for genomes with a heterozygosity <1, Flye, MaSuRCA_F and MaSuRCA_C were classified as 'high'; SPAdes, Platanus-allee, Canu, Redbean and NextDenovo as 'medium'; and WENGAN-M, HASLR and miniasm as 'low'.For genomes with a heterozygosity ≥1, Canu and MaSuRCA_C were classified as 'high'; NextDenovo, MaSuRCA_F, Redbean, Flye and Platanus-allee as 'medium'; and SPAdes, WENGAN-M, miniasm and HASLR as 'low'.Across the various levels of heterozygosity, MaSuRCA_F, MaSuRCA_C, Flye and Canu provided stable and high BUSCO completeness.The complete and duplicated BUSCO scores were higher in most assemblers for genomes with a heterozygosity >0.5 (e.g. S. sitiens, A. thaliana F1, and C. gigas) than for the other genomes; that of Canu was particularly large.

Computational resource usages
Computational resource usages in the de novo assembly process are represented in Figure 5 and the concrete numeric values, in Supplementary Table S9.The maximum wall-clock time values exceeded 250 h for NextDenovo, SPAdes, Platanus-allee, Canu, MaSuRCA_F and MaSuRCA_C, whereas that of HASLR, WENGAN-M, Redbean and miniasm did not exceed 50 h.Flye had a maximum wall-clock time of 90 h.The wall-clock time of Canu of L. calcarifer exceeded 500 h and became a 'timeout'.Subsequently, to evaluate the wall-clock times, HASLR, WENGAN-M, Redbean and miniasm were classified as 'Light'; Flye, as 'Medium'; and NextDenovo, SPAdes, Platanus-allee, Canu, MaSuRCA_F and MaSuRCA_C, as 'Heavy' (Table 2C).
The maximum value of memory usage exceeded 400 GB for SPAdes, Platanus-allee, Canu, MaSuRCA_F and MaSuRCA_C.Neither HASLR nor WENGAN-M used >50 GB of memory in any condition.The maximum memory usages of Redbean and miniasm were both <300 GB but >50 GB.The maximum memory usages for Flye and NextDenovo were 343 and 364 GB, respectively.Consequently, to evaluate memory usage, HASLR and WENGAN-M were   2C).
We then classified the de novo assemblers based on the computational resource usage comprising the wall-clock time and the maximum memory usage.HASLR, WENGAN-M, Redbean and miniasm were classified as 'Light'; Flye, as 'Medium'; and NextDenovo, SPAdes, Platanus-allee, Canu, MaSuRCA_F and MaSuRCA_C, as 'Heavy'.Hereinafter, assemblers are referred to as 'light-weight tool', 'medium-weight tool' or 'heavy-weight tool' according to the categories above based on their computational resource usage.

Assembly ploidy
After the execution of Purge Haplotigs, assembly ploidies in most of the assemblers were closer to 1 across the genomes (Figure 2 and Supplementary Figure S2).Although there was no assembly with ploidy over 2, only miniasm, Flye, Canu and MaSuRCA_C provided an assembly ploidy >1.5 for genomes with higher heterozygosities, such as for C. gigas (heterozygosity, 3.00), suggesting difficulty in haplotype removal for these genomes.The concrete statistical values for primary contigs, haplotigs, and artefacts are indicated in Supplementary Table S10.

Annotation completeness (BUSCO)
The output files from Purge Haplotigs, primary contigs, haplotigs, and artefacts were examined for BUSCO completeness (Supplementary Figures S4-S6) to evaluate the validity of the purging process.The contig sets with higher complete and duplicated BUSCO scores were purged well (Figure 2).The highest BUSCO duplicated score before purging was 79.2% in Canu of C. gigas; after purging, this score drastically decreased to 12.1%.The BUSCO duplicated scores of C. gigas, which were higher in any of the assemblers than the other genomes, decreased to 14.9% after purging for MaSuRCA_C at the highest.In contrast, the BUSCO completeness of primary contigs was similar to that before the removal of haplotigs from across the genomes (Supplementary Table S11).These results suggest that some might have been over-purged.To survey the over-purge, we examined (i) the number of BUSCO genes that were detected on haplotigs but not on primary contigs and (ii) the number of BUSCO genes that were detected on artefacts but not on primary contigs (Supplementary Table S12).Consequently, an over-purge >1% of the total BUSCO genes was not observed in any of the assembled genomes with HASLR, WENGEN-M, Flye, NextDenovo, MaSuRCA_F or MaSuRCA_C.Conversely, for genomes with a heterozygosity >0.847 (such as that for S. sitiens), an over-purge was observed in some assemblers.The maximum number of (A) was 160 in miniasm for A. thaliana F1 (9.9% of the overall BUSCO genes), and that of (B) was 10 in Platanus-allee for A. thaliana F1 (0.6% of the overall BUSCO genes).

Comparison with Arabidopsis reference genome
To evaluate how effectively the above processes (assembly, polishing and purging) reconstructed the haploid sequences, the assembled A. thaliana F1 primary contigs were compared with the haploid sequences of Col-0 and Cvi-0, which are parent accessions (Supplementary Table S13).The genome fractions, NGA50 values, or the number of misassemblies between F1 and Col-0 and between F1 and Cvi-0 were comparable.For Redbean, Flye, SPAdes, Platanus-allee, Canu and MaSuRCA_C, the genome fractions were > 90%, while they were 88% for MaSuRCA_F and approximately 80% for NextDenovo and WENGAN-M.However, in miniasm and HASLR, the genome fractions were 52% and < 50%, respectively.The NGA50 values in Redbean, Flye, Canu, MaSuRCA_F and MaSuRCA_C were > 130 Kb; miniasm had the lowest NGA50 value (approximately 3000 b).The NGA50 values of the remaining assemblers were almost 8000-12 000 b. HASLR has no NGA50 value because the total aligned length is <50% of the length of the parent reference genome.While the number of misassemblies was the smallest in HASLR (approximately 450), that in the others was 1750-4722, with that in Platanus-allee being the highest.

Analytical guideline
We devised a guideline to construct a haploid representation with PacBio long reads and Illumina short reads for diploid genomes with various levels of heterozygosity (Figure 6).This was based on the evaluation of the continuities and BUSCO scores for each process of assembling, polishing and purging haplotigs using various assemblers.First, to understand the sample properties such as genome size, the heterozygosity and repeat rate are estimated using tools such as GenomeScope.For evaluating de novo assemblers, it is recommended to use only polished contigs after assembly.For genomes with any heterozygosities, the first recommended assembler is Redbean, a light-weight tool (Table 2C) with a stable performance regarding continuity and BUSCO completeness, regardless of heterozygosity (Table 2A and B).Redbean can provide a rough indication of computational resource usage, continuity and BUSCO completeness when using other additional assemblers.For genomes with a heterozygosity <1, Flye can be used as the second trial assembler because it is a mediumweight tool (Table 2C) classified as 'High' for both continuity and BUSCO completeness when heterozygosity is <1 (Table 2A).If memory and execution time are more available than usage for Flye, MaSuRCA_C should be used because it is a heavy-weight tool (Table 2C) classified as 'High' for both continuity and BUSCO completeness when heterozygosity is <1 (Table 2A).For genomes with a heterozygosity ≥1, MaSuRCA_C should be used as the alternative second trial assembler because it is a heavy-weight tool (Table 2C) classified as 'High' both for continuity and BUSCO completeness and has a stable performance across the genomes with any heterozygosity (Table 2A and B).If MaSuRCA_C does not terminate successfully or the execution time is too long, it is better to use Flye as a medium-weight tool even for genomes with a heterozygosity ≥1.Flye is inferior to MaSuRCA_C in both continuity and BUSCO completeness (Table 2B), but it provides stable results with lower computational resources than MaSuRCA_C does (Table 2C).If two or more assemblers are used, their continuity and BUSCO completeness must be compared.Finally, removal of haploid duplication is performed using tools such as Purge Haplotigs.This process is more important for genomes with a higher heterozygosity.After purging, the results need to be verified, and manual curation is required.

DISCUSSION
Herein, we evaluated the procedures for constructing a haploid representation from PacBio long reads and Illumina short reads by focusing on heterozygosity and suggested an analytical guideline adapted to the degree of heterozygosity.The guideline includes: (i) estimation of genome information, including genome size and heterozygosity, (ii) de novo assembly, (iii) polishing and (iv) removal of duplicated haploid sequences.The five long-readonly and hybrid assemblers, respectively, were assessed for computer resource usage, contig continuity and BUSCO completeness (Table 2).Contig continuity and BUSCO completeness were separately evaluated for genomes with heterozygosities <1 and for those with heterozygosities ≥1.Subsequently, we selected three high-performance assemblers: Redbean from light-weight tools, Flye from medium-weight tools and MaSuRCA_C from heavyweight tools.These assemblers were incorporated into the analytical guideline (Figure 6).
We focused on heterozygosity and proposed recommended assemblers.Nevertheless, the best genome-assembler combination would be affected by heterozygosity, repeats, genome size and other factors that cannot be determined without practically testing them.Thus, it is essential to understand the features of assemblers and perform a selective trial.We recommend checking the following items to choose the best assembly contigs: (i) BUSCO completeness, (ii) continuity and (iii) comparison between contig size after purging haplotigs and estimated genome size.For (i), this study conducted polishing once.However, polishing iteration may be effective if time permits, particularly for Redbean and miniasm, which showed significant polishing effects.A comparison of assemblers sometimes shows a similar BUSCO completeness

Figure 5 .
Figure 5. Computational resource usage of de novo assemblers.Canu for Lates calcarifer timed-out.See Supplementary Table S9 for actual wall-clock time and memory.(A) Wall-clock time in hours.Two horizontal dashed lines indicate 50 and 250 h.(B) Memory usage.Three horizontal dashed lines indicate 50, 300 and 400 GB.

Figure 6 .
Figure 6.Recommended analytical guideline.This guideline consists of four processes: sample properties estimation, de novo assembly, polishing and purging haplotigs.De novo assemblers are recommended separately for heterozygosities <1 and ≥ 1.

Table 1 .
Summary of input datasets